Abstract

Author: Charles Tapley Hoyt

This notebook outlines the systematic assessment of errors in a BEL document. The data used is from the Alzheimer's Disease (AD) knowledge assembly model that has been annotated with the NeuromMMSig Database. Error analysis is not meant to place blame on contributors to a BEL document, but rather to inform curation leaders where recuration efforts should be focused and to make analysts aware of the issues with a given BEL document.

Notebook Setup



In [1]:

    
import logging
import os
import re
import time
from collections import Counter, defaultdict
from operator import itemgetter

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from fuzzywuzzy import process, fuzz
from matplotlib_venn import venn2

import pybel
from pybel.constants import PYBEL_DATA_DIR
from pybel.manager.cache import CacheManager
from pybel.parser import MetadataParser

import pybel_tools as pbt
from pybel_tools.utils import barh, barv



In [2]:

    
%config InlineBackend.figure_format = 'svg'
%matplotlib inline



In [3]:

    
logging.getLogger('pybel.cache').setLevel(logging.CRITICAL)

Notebook Provenance

The time of execution and the versions of the software packegs used are displayed explicitly.



In [4]:

    
time.asctime()









    Out[4]:





'Tue Apr 25 11:07:43 2017'



In [5]:

    
pybel.__version__









    Out[5]:





'0.5.4-dev'



In [6]:

    
pbt.__version__









    Out[6]:





'0.1.8-dev'

Local Path Definitions

To make this notebook interoperable across many machines, locations to the repositories that contain the data used in this notebook are referenced from the environment, set in ~/.bashrc to point to the place where the repositories have been cloned. Assuming the repositories have been git clone'd into the ~/dev folder, the entries in ~/.bashrc should look like:

...
export BMS_BASE=~/dev/bms
export BANANA_BASE=~/dev/banana
export PYBEL_RESOURCES_BASE=~/dev/pybel-resources
...

BMS

The biological model store (BMS) is the internal Fraunhofer SCAI repository for keeping BEL models under version control. It can be downloaded from https://tor-2.scai.fraunhofer.de/gf/project/bms/



In [7]:

    
bms_base = os.environ['BMS_BASE']

PyBEL Resources

PyBEL resources is a set of namespaces and annotations that have been made avaliable by the AETIONOMY project through collaboration between the PyBEL Core team and the NeuromMMSig Database developers. It can be downloaded from https://github.com/pybel/pybel-resources



In [8]:

    
pybel_resources_base = os.environ['PYBEL_RESOURCES_BASE']

Data

The Alzheimer's Disease Knowledge Assembly has been precompiled with the following command line script, and will be loaded from this format for improved performance. In general, derived data, such as the gpickle representation of a BEL script, are not saved under version control to ensure that the most up-to-date data is always used.

pybel convert --path "$BMS_BASE/aetionomy/alzheimers.bel" --pickle "$BMS_BASE/aetionomy/alzheimers.gpickle"

The BEL script can also be compiled from inside this notebook with the following python code:

>>> import os
>>> import pybel
>>> # Input from BEL script
>>> bel_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.bel')
>>> graph = pybel.from_path(bel_path)
>>> # Output to gpickle for fast loading later
>>> pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.gpickle')
>>> pybel.to_pickle(graph, pickle_path)



In [9]:

    
pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers', 'alzheimers.gpickle')



In [10]:

    
graph = pybel.from_pickle(pickle_path)



In [11]:

    
graph.version









    Out[11]:





'3.0.2'

Error Analysis

As stated in the pybel.BELGraph documentation, all warnings during BEL compilation are stored in a list. These contain information about the BEL statement, the line number, the type of error, and the current annotations in the parser at the time of error. PyBEL tools makes many functions avaliable for systematically analyzing these errors.

The total number of errors is listed below.



In [12]:

    
len(graph.warnings)









    Out[12]:





3549

The types of errors in a graph and their frequencies can be calculated using pbt.summary.count_error_types.



In [13]:

    
error_counter = pbt.summary.count_error_types(graph)



In [14]:

    
barh(error_counter, plt)

A common type of error is to use names that aren't contained within a namespace. These are thrown with pybel.parser.parse_exceptions.MissingNamespaceNameWarning. The PyBEL Tools function pbt.summary.calculate_incorrect_name_dict creates a dictionary for each namespace which incorrect names were used and their frequencies.



In [15]:

    
incorrect_name_dict = pbt.summary.calculate_incorrect_name_dict(graph)

Using pbt.utils.count_dict_values, the number of unique incorrect names for each namespace is extracted and plotted.



In [16]:

    
barh(pbt.utils.count_dict_values(incorrect_name_dict), plt)

Another common error is to write identifiers without a namespace, or for short, a naked name. The function pbt.summary.count_naked_names returns a counter of how many time each naked name appeared.



In [17]:

    
naked_names = pbt.summary.count_naked_names(graph)

The number of unique naked names can be directly calculated with len() on the resulting counter from pbt.summary.count_naked_names.



In [18]:

    
len(naked_names)









    Out[18]:





210

The 25 most common naked names are output below.



In [19]:

    
naked_names.most_common(25)









    Out[19]:





[('amyloid beta peptides', 310),
 ('p', 16),
 ('Microglia', 15),
 ('AICD', 13),
 ('NICD', 12),
 ('YENPTY endocytosis motif (APP)', 12),
 ('Platelets', 11),
 ('2-butenal', 11),
 ('Beta cell', 10),
 ('mitochondrial dysfunction', 9),
 ('BACE1,gmod(M)', 9),
 ('APOE e2', 7),
 ('N-methyl-D-aspartate selective glutamate receptor activity', 7),
 ('AC253', 6),
 ('L1RE1,gmod(M)', 6),
 ('Amyloid beta-peptides', 6),
 ('N-APP', 5),
 ('astrogliosis', 5),
 ('neuroinflammation', 5),
 ('MTHFR,sub(C,677,T)', 5),
 ('MTHFR,sub(T,677,T)', 5),
 ('BDNF,gmod(M)', 5),
 ('17A', 4),
 ('chemokines', 4),
 ('chronic CNS injury', 4)]

Overall, the same error made multiple times is grouped to identify the most frequent errors.



In [20]:

    
error_groups = pbt.summary.group_errors(graph)

error_group_counts = Counter({k:len(v) for k,v in error_groups.items()})
error_group_counts.most_common(24)









    Out[20]:





[('[pos:2] "amyloid beta peptides" should be qualified with a valid namespace',
  153),
 ('"alpha-soluble amyloid precursor protein" is not in the ADO namespace', 63),
 ("Missing citation; can't add: UNSET MeSHAnatomy", 55),
 ("Missing citation; can't add: UNSET Species", 47),
 ('"Neuroinflammation" is not in the MESHD namespace', 42),
 ('"copper sulphate(5.H2O)" is not in the CHEBI namespace', 37),
 ('"cerebrolysin" is not in the PMICHEM namespace', 34),
 ('"nefiracetam" is not in the CHEBI namespace', 34),
 ("Missing citation; can't add: UNSET Disease", 34),
 ('[pos:25] "amyloid beta peptides" should be qualified with a valid namespace',
  31),
 ("Missing citation; can't add: UNSET Subgraph", 29),
 ('Missing citation; can\'t add: SET Species = "10090"', 28),
 ('Missing citation; can\'t add: SET Disease = "Alzheimer Disease"', 27),
 ('"cytokine activity" is not in the GOBP namespace', 26),
 ('"neuroprotection" is not in the PMIBP namespace', 26),
 ('"synaptic transmission" is not in the GOBP namespace', 24),
 ('Missing citation; can\'t add: SET MeSHAnatomy= "Brain"', 23),
 ('"MeSHAnatomy" is not set, so it can\'t be unset', 22),
 ('Missing citation; can\'t add: SET Species = "9606"', 20),
 ('"excitatory synapse" is not in the GOBP namespace', 17),
 ('Abundance GOCC:inflammasome complex should be encoded as one of: Complex',
  17),
 ('"catalysis of free radical formation" is not in the GOBP namespace', 15),
 ('"positive regulation of clathrin-mediated endocytosis" is not in the GOBP namespace',
  15),
 ('"Cell" is not set, so it can\'t be unset', 15)]

Error Analysis by Annotation

It might be useful to group the errors by a certain annotation/value pair. In these examples, the NeuroMMSig Database subgraph annotations are used. Ultimately, the error frequency will be compared to the size of each subgraph.

First, the sizes of the top 30 largest subgraphs are shown below.



In [21]:

    
size_by_subgraph = pbt.summary.count_annotation_values(graph, 'Subgraph')



In [22]:

    
plt.figure(figsize=(10, 3))
barv(dict(size_by_subgraph.most_common(30)), plt)
plt.yscale('log')
plt.title('Top 30 Subgraph Sizes')
plt.show()

The list of all errors for each subgraph can be calculated with pbt.summary.calculate_error_by_annotation.



In [23]:

    
error_by_subgraph = pbt.summary.calculate_error_by_annotation(graph, 'Subgraph')

These data are aggregated to a count of the number of items in each list with pbt.utils.count_dict_values. The top 30 most error-prone subgraphs are shown below.



In [24]:

    
error_by_subgraph_count = pbt.utils.count_dict_values(error_by_subgraph)

plt.figure(figsize=(10, 3))
barv(dict(error_by_subgraph_count.most_common(30)), plt)
plt.yscale('log')
plt.ylabel('Errors')
plt.title('Top 30 Most Error-Prone Subgraphs')
plt.show()

Finally, the size to error ratio is calculated for each subgraph below. The 25 subgraphs with the highest error to size ratio are shown.



In [25]:

    
subgraphs = sorted(size_by_subgraph)
df_data = [(size_by_subgraph[k], error_by_subgraph_count[k], error_by_subgraph_count[k] / size_by_subgraph[k]) for k in subgraphs]

df = pd.DataFrame(df_data, index=subgraphs, columns=['Size', 'Errors', 'E/S Ratio'])

df.to_csv('~/Desktop/errors.tsv', sep='\t')
df.sort_values('E/S Ratio', ascending=False).head(25)









    Out[25]:






  
    
      
      Size
      Errors
      E/S Ratio
    
  
  
    
      Syndecan subgraph
      15
      38
      2.533333
    
    
      Neurotransmitter release subgraph
      6
      8
      1.333333
    
    
      Neurotrophic subgraph
      56
      63
      1.125000
    
    
      Amylin subgraph
      20
      14
      0.700000
    
    
      Metabolism of steroid hormones subgraph
      8
      5
      0.625000
    
    
      CRH subgraph
      28
      17
      0.607143
    
    
      Axonal transport subgraph
      33
      18
      0.545455
    
    
      Neuroprotection subgraph
      49
      26
      0.530612
    
    
      Protein biosynthesis subgraph
      2
      1
      0.500000
    
    
      Glutamatergic subgraph
      222
      95
      0.427928
    
    
      Prostaglandin subgraph
      60
      25
      0.416667
    
    
      Autophagy signaling subgraph
      24
      10
      0.416667
    
    
      Alpha 2 macroglobulin subgraph
      23
      9
      0.391304
    
    
      Cytokine signaling subgraph
      57
      21
      0.368421
    
    
      Non-amyloidogenic subgraph
      459
      161
      0.350763
    
    
      JAK-STAT signaling subgraph
      47
      16
      0.340426
    
    
      GABA subgraph
      154
      51
      0.331169
    
    
      Response to oxidative stress
      97
      27
      0.278351
    
    
      Synaptic vesicle endocytosis subgraph
      79
      21
      0.265823
    
    
      Blood vessel dilation subgraph
      23
      6
      0.260870
    
    
      NMDA receptor
      138
      36
      0.260870
    
    
      Lipid peroxidation subgraph
      27
      7
      0.259259
    
    
      Response DNA damage
      8
      2
      0.250000
    
    
      Acetylcholine signaling subgraph
      348
      85
      0.244253
    
    
      Androgen subgraph
      42
      10
      0.238095

The overall distribution of subgraph sizes and error counts is shown below. It indicates that they trend with a positive correlation, but there are clear outliers showing careful curation for large subgraphs, and sloppy curation for smaller ones.



In [26]:

    
sns.lmplot('Size', 'Errors', data=df)
plt.title('BEL Errors as a function of Subgraph Size')
plt.show()

Conclusions

The BEL language lacks utility for curator provenance. When multiple curators contribute to a single document, it's often difficult to trace where errors originate and identify the individual responsible for fixing them. When work is contracted, error analysis is also crucial for assessing the quality of the curation. PyBEL makes error messages programatically accessible and easy to summarize.

These functions were used to build a web interface to give feedback to BEL curators who are not comfortable with programming. The code and instructions for deployment are avaliable at https://github.com/cthoyt/pybel-web-validator.

	Size	Errors	E/S Ratio
Syndecan subgraph	15	38	2.533333
Neurotransmitter release subgraph	6	8	1.333333
Neurotrophic subgraph	56	63	1.125000
Amylin subgraph	20	14	0.700000
Metabolism of steroid hormones subgraph	8	5	0.625000
CRH subgraph	28	17	0.607143
Axonal transport subgraph	33	18	0.545455
Neuroprotection subgraph	49	26	0.530612
Protein biosynthesis subgraph	2	1	0.500000
Glutamatergic subgraph	222	95	0.427928
Prostaglandin subgraph	60	25	0.416667
Autophagy signaling subgraph	24	10	0.416667
Alpha 2 macroglobulin subgraph	23	9	0.391304
Cytokine signaling subgraph	57	21	0.368421
Non-amyloidogenic subgraph	459	161	0.350763
JAK-STAT signaling subgraph	47	16	0.340426
GABA subgraph	154	51	0.331169
Response to oxidative stress	97	27	0.278351
Synaptic vesicle endocytosis subgraph	79	21	0.265823
Blood vessel dilation subgraph	23	6	0.260870
NMDA receptor	138	36	0.260870
Lipid peroxidation subgraph	27	7	0.259259
Response DNA damage	8	2	0.250000
Acetylcholine signaling subgraph	348	85	0.244253
Androgen subgraph	42	10	0.238095